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Abstract 

We study the issue of PAC-Bayesian domain 
adaptation: We want to learn, from a source do¬ 
main, a majority vote model dedicated to a target 
one. Our theoretical contribution brings a new 
perspective by deriving an upper-bound on the 
target risk where the distributions’ divergence— 
expressed as a ratio—controls the trade-off be¬ 
tween a source error measure and the target vot¬ 
ers’ disagreement. Our bound suggests that one 
has to focus on regions where the source data is 
informative. From this result, we derive a PAC- 
Bayesian generalization bound, and specialize it 
to linear classifiers. Then, we infer a learning al¬ 
gorithm and perform experiments on real data. 

1. Introduction 

Machine learning practitioners are commonly exposed to 
the issue of domain adaptation^ (Jiang, 2008; Margolis, 
2011): One usually learns a model from a corpus, i.e., a 
fixed yet unknown source distribution, then wants to apply 
it on a new corpus, i.e., a related but slightly different target 
distribution. Therefore, domain adaptation is widely stud¬ 
ied in a lot of application fields like computer vision (Pa¬ 
tel et al., 2015; Ganin & Lempitsky, 2015), bioinformat¬ 
ics (Liu et al., 2008), natural language processing (Blitzer, 
2007; Daume III, 2007), etc. A common example is the 

'Domain adaptation is associated with transfer learning (Pan 
& Yang, 2010; Quionero-Candela et al., 2009). 
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spam filtering problem where a model needs to be adapted 
from one user mailbox to another receiving significantly 
different emails. Many approaches exist to address domain 
adaptation, often with the same idea: If we can apply a 
transformation to “move closer” the distributions, then we 
can learn a model with the available labels. This is gener¬ 
ally performed by reweighting the importance of labeled 
data (Huang et al., 2006; Sugiyama et al., 2007; Cortes 
et al., 2010; 2015), and/or by learning a common represen¬ 
tation for the source and target distributions (Chen et al., 
2012; Ganin et al., 2016), and/or by minimizing a measure 
of divergence between the distributions (Morvant et al., 
2012; Germain et al., 2013; Cortes & Mohri, 2014). The 
divergence-based approach has especially been explored to 
derive generalization bounds for domain adaptation (e.g., 
Ben-David et al., 2006; 2010; Mansour et al., 2009; Li 
& Bilmes, 2007; Zhang et al., 2012). Recently, this is¬ 
sue has been studied through the PAC-Bayesian frame¬ 
work (Germain et al., 2013), which focuses on learning 
weighted majority votes^ without target label. Even the lat¬ 
ter result opened the door to tackle domain adaptation in a 
PAC-Bayesian fashion, it shares the same philosophy as the 
seminal works of Ben-David et al. (2006; 2010); Mansour 
et al. (2009): The risk of the target model is upper-bounded 
jointly by the model’s risk on the source distribution, the 
divergence between the marginal distributions, and a non- 
estimable term^ related to the ability to adapt in the cur¬ 
rent space. Note that Li & Bilmes (2007) proposed a PAC- 
Bayesian generalization bound for domain adaptation but 
they considered target labels. 

^This setting is not too restrictive since many algorithms can 
be seen as a majority vote learning. E.g., ensemble learning and 
kernel methods output models interpretable as majority votes. 

^More precisely, this term can only be estimated in the pres¬ 
ence of labeled data from both the source and the target domains. 
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In this paper, we derive a novel domain adaptation bound 
for the weighted majority vote framework. Concretely, the 
risk of the target model is still upper-bounded by three 
terms, but they differ in the information they capture. The 
hrst term is estimable from unlabeled data and relies on a 
notion of expected voters’ disagreement on the target do¬ 
main. The second term depends on the expected accuracy 
of the voters on the source domain. Interestingly, this lat¬ 
ter is weighted by a divergence between the source and the 
target domains that enables controlling the relationship be¬ 
tween domains. The third term estimates the “volume” of 
the target domain living apart from the source one^, which 
has to be small for ensuring adaptation. From our bound, 
we deduce that a good adaptation strategy consists in hnd- 
ing a weighted majority vote leading to a suitable trade¬ 
off—controlled by the domains’ divergence—^between the 
first two terms; Minimizing the hrst one corresponds to 
look for voters that disagree on the target domain, and mi¬ 
nimizing the second one to seek accurate voters on the 
source. Thereafter, we provide PAC-Bayesian generaliza¬ 
tion guarantees to justify the empirical minimization of our 
new domain adaptation bound, and specialize it to linear 
classihers (following a methodology known to give rise to 
tight bound values). This allows to design DALC, a learning 
algorithm that improves the performances of the previous 
PAC-Bayesian domain adaptation algorithm. 

The rest of the paper is organized as follows. Section 2 
presents the PAC-Bayesian domain adaptation setting. Sec¬ 
tion 3 reviews previous theoretical results on domain adap¬ 
tation. Section 4 states our new analysis of domain adap¬ 
tation for majority votes, that we relate to other works in 
Section 5. Then, Section 6 provides generalization bounds, 
specialized to linear classihers in Section 7 to motivate the 
DALC learning algorithm, evaluated in Section 8. 


McAllester, 1999). More precisely, we adopt the PAC- 
Bayesian domain adaptation setting previously studied in 
Germain et al. (2013). Given %, a set of voters h :'K ^Y, 
the elements of this approach are a prior distribution tt on 
Tf, a pair of source-target learning samples {S, T) and a 
posterior distribution p on H. The prior distribution tt 
models an a priori belief—^before observing {S, T )—of the 
voters’ accuracy. Then, given the information provided by 
{S, T), we aim at learning a posterior distribution p leading 
to a p-weighted majority vote over %, 


Bp{-) = sign 


E hi-) 

rir-^P 


with nice generalization guarantees on the target domain T- 
In other words, we want to hnd the posterior distribution p 
minimizing the true target risk of Bp ; 


Rr{Bp) = E l[Bpi^)^y\, 

(x,y)~r 

where I [a] = 1 if a is true, and 0 otherwise. However, 
in most PAC-Bayesian analyses one does not directly focus 
on this majority vote risk, but studies the expectation of the 
risks over B according to p, designed as the Gibbs risk: 


Rx>(G'p) 


E E I[h(x)^y]. 

(■x.,y)~T> h~p 


( 1 ) 


It is well-known in the PAC-Bayesian literature that 
Bv{Bp) < 2Rx,(Gp) {e.g., Herbrich & Graepel, 2000). 
Unfortunately, this worst case bound often leads to poor 
generalization guarantees on the majority vote risk. To ad¬ 
dress this issue, Lacasse et al. (2006) (rehned in Germain 
et al., 2015) have exhibited that one can obtain a tighter 
bound on Rx)(i?p) by studying the expected disagreement 
d-D [p) of pairs of voters, dehned as 

dvip)= E E E I[h(x) ^h'(x)], (2) 

hr^p h'^p 


2. Unsupervised Domain Adaptation Setting 

We tackle domain adaptation for binary classification, 
from a d-dimensional input space XCM'^ to an output 
space Y = { — 1,1}. Our goal is to perform domain adapta¬ 
tion from a distribution S —the source domain —to another 
(related) distribution T—the target domain —on XxU; Sx 
and 7x being the associated marginal distributions on X. 
Given a distribution V, we denote (I?)"* the distribution of 
a TO-sample constituted by m elements drawn i.i.d. from V. 
We consider the unsupervised domain adaptation setting in 
which the algorithm is provided with a labeled source m^- 
sample >S' = {(xi, ^ (5)™% and with an unlabeled 

target mt-sample T = ^ (7x)™* ■ 

PAC-Bayesian domain adaptation. Our work is in¬ 
spired by the PAC-Bayesian theory (hrst introduced by 

^Here we do not focus on learning a new representation to help 
the adaptation: We directly aim at adapting in the current space. 


as R-p(i3p) < 1— • Note that, although re¬ 

lying on dxiip), our present work does not reuse the lat¬ 
ter result.^ Instead, we adopt another well-known strategy 
to obtain tight majority vote bounds, by specializing our 
PAC-Bayesian bound to linear classihers. We describe this 
approach, and refer to related works, in Section 7. 

3. Some Previous Domain Adaptation Bounds 

Many approaches tackling domain adaptation share the 
same underlying “philosophy”, pulling its origins in the 
work of Ben-David et al. (2006; 2010) which proposed a 
domain adaptation bound (Theorem 1, below). To summa¬ 
rize, the domain adaptation bounds reviewed in this section 
(see Zhang et al., 2012; Cortes et al., 2010; 2015, for other 

^The quantity d-pjp) is also used in the domain adaptation 
bound of Germain et al. (2013) to measure divergence between 
distributions. See forthcoming Theorem 2. 
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bounds) express a similar trade-off between three terms: (i) 
the source risk, (ii) the distance between source and target 
marginal distributions over X, (iii) a non-estimable term 
(without target label) quantifying the difficulty of the task. 
Ben-David et al. (2006) assumed that the domains 
are related in the sense that there exists a (unknown) 
model performing well on both domains. Formally, 
their domain adaptation bound depends on the error 
fifi* =^s{h*)+^T{h*) of the best hypothesis overall 
h* =argmin;jg^ (}is{h) + R 7 -(/i)). In practice, when no 
target label is available, is non-estimable and is as¬ 
sumed to be low when domain adaptation is achievable (or 
at least that there exists a representation space in which this 
assumption can be verified). In such a scenario, the domain 
adaptation strategy is then to look for a set H of possible 
models that behave “similarly” on both the source and tar¬ 
get data, and to learn a model in H with a good accuracy on 
the source data. This similarity, called the 'HA'H-distance, 


E l[Mx)^h'(x)]- E I[h(x)^h'(x)] 
x~6x x~/x 


d'HA'H{Sx.,Tx^) = 

2 sup 
{h,h’)e'H^ 

gives rise to the following domain adaptation bound. 
Theorem 1 (Ben-David et al., 2006; 2010). Let LL be a 
(symmetric^) hypothesis class. We have, 


yheH, Rr(^) < ^s{h) + ^d'HA'H{Syi,Tx) + tih-'- (3) 


Pursuing in the same line of research, Mansour et al. (2009) 
generalizes the ^A'H-distance to real-valued loss func¬ 
tions C : [—1,1]^ M’*', to express a similar theorem for 

regression. Their discrepancy disc£(5x, 7x) is dehned as 

sup E £(h(x), h'(x)) — E £(/i(x), h'(x)) . 
{h,h')e'H^ x~Sx 

The accuracy of the Mansour et al. (2009)’s bound also 
relies on a non-estimable term assumed to be low when 
adaptation is achievable. Roughly, this term depends on 
the risk of the best target hypothesis and its agreement 
with the best source hypothesis on the source domain. 


Theorem 2 (Germain et al., 2013). Let % be a set of vot¬ 
ers. For any domains S and T over X x E, we have, 
yponH, Rr(Gp) < R 5 (Gp)-I-disp(5x, 7x) + A(p). 

A compelling aspect of this PAC-Bayesian analysis is the 
suggested trade-off, which is function of p. Indeed, given 
a fixed instance space X and a fixed class LL, apart from 
using importance weighting methods, the only way to min¬ 
imize the bound of Theorem 1 is to hnd h gH that mini¬ 
mizes Rs{h). In Germain et al. (2013), the bound of The¬ 
orem 2 inspired an algorithm—named PBDA —selecting 
p over H that achieves a trade-off between R 5 (Gp) and 
diSp(5x, 7x)- However, the term X{p) does not appear in 
the optimization process of PBDA, even if it relies on the 
learned weight distribution p. It is assumed that the value of 
A(p) should be negligible (uniformly for all p) when adap¬ 
tation is achievable. Nevertheless, this strong assumption 
cannot be verified because the best target posterior distri¬ 
bution pr* is unknown. This is a major weakness of the 
previous PAC-Bayesian work that our new approach over¬ 
comes. 


4. A New Domain Adaptation Perspective 

In this section, we introduce an original approach to upper- 
bound the non-estimable risk of a p-weighted majority vote 
on a target distribution T thanks to a term depending on its 
marginal distribution 7 x. another one on a related source 
domain S, and a term capturing the “volume” of the source 
distribution uninformative for the target task. We base our 
bound on the expected disagreement dx) (p) of Equation (2) 
and the expected joint error ex)(p), defined as 

ev{p) = E E E I[h(x) ^ y] l[h'(x) ^ y] . (5) 

(x,y)r,^V h~ph'~p 

Indeed, Lacasse et al. (2006); Germain et al. (2015) ob¬ 
served that, given a domain 22 on X x E and a distribution 
p on H, we can decompose the Gibbs risk as 


Building on previous domain adaptation analyses, Germain 
et al. (2013) derived a PAC-Bayesian domain adaptation 
bound. This bound is based on a divergence suitable for 
PAC-Bayes, i.e., for the risk of a p-weighted majority vote 
of the voters of H (instead of a single classiher h G H). 
This domain disagreement diSp(5x, 7x) is dehned as 

disp(5x,7x) = I d5(p) - dr(p) | • (4) 

Theorem 2 (below) needs the strong assumption that, 
in favorable adaptation situations, the learned posterior 
agrees with the best target one pf* = argmin^R 7 -(Gp). 
Indeed, it relies on the following non-estimable term: 

A(p) = R-riGp-j-f yFih—pFih' Ex~5x l[h(x) 7 ^/i (x)] -I- 

®In a symmetric T-L, for all h gH, its inverse —h is also in T-L. 


R^,(Gp)=i E E E I[h(x)^p]+I[h'(x)^y] 

^ (^,y)~'Dh^ph'r..p 

^ ^ g l[h(x)7^fe'(x)]-b2l[fe(x)7^pA/i'(x)7^p] 

{yL,y)^V h~ph'~p 2 

= 5 dx,(p)-b ex,(p). ( 6 ) 

A key observation is that the voters’ disagreement does not 
rely on labels', we can compute dx>{p) using the marginal 
distribution 22x- Thus, in the present domain adaptation 
context, we have access to d 7 -(p) even if the target labels 
are unknown. However, the expected joint error can only 
be computed on the labeled source domain. 

Domains’ divergence. In order to link the target joint er¬ 
ror erip) with the source one 65 (p), we weight the lat¬ 
ter thanks to a divergence measure between the domains 
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Pq{T\\S) parametrized by a real value g > 0: 


p.ins) 


(x.j/)~5 \ S{yL,y) 



(7) 


It is worth noting that considering some q values al¬ 
low us to recover well-known divergences. For in¬ 
stance, choosing q = 2 relates our result to the x^-distance, 
as /32(T|15) = \/x^(7’||5) -f 1. Moreover, we can link 
Pq{T\\S) to the Renyi divergence^, which has led to gen¬ 
eralization bounds in the context of importance weighting 
(Cortes et al., 2010). We denote the limit case g—>oo by 

'T(x,2/)' 


/3oo(T||5) = sup 

(x,y)GSUPP(5) 


5(x,y) 


with supp(iS) the support of S. The divergence /3g(T||5) 
handles the input space areas where the source domain 
support supp(5) is included in the target one supp(T). 
It seems reasonable to assume that, when adaptation is 
achievable, such areas are fairly large. However, it is 
likely that supp(T) is not entirely included in supp(5). 
We denote T\S the distribution of (x, y)^T conditional to 
(x, ?/)€supp(T)\supp(5). Since it is hardly conceivable to 
estimate the joint error eT\s{p) without making extra as¬ 
sumptions, we define the worst risk for this unknown area 


vr\s = Pr ((x, 2 /) ^ supp(5)) snp Rr\sih). (8) 
(x,y)~rv / hen 


Even if we cannot evaluate sup^ R'j-\s{h), the value of 
VtXS is necessarily lower than Pr 7 -((x, j/)^supp(5)). 

The domain adaptation bound. Let us state the result 
underlying the domain adaptation perspective of this paper. 
Theorem 3. Let % be a hypothesis space, let S and T 
respectively be the source and the target domains on XxE. 
Let q > 0 be a constant. We have, for all p on R, 


Rr(Gp) < ^drip) + l3q{T\\S)x[esip)]' ' 


■ VT\S : 


where ^.^-(p), es{p), l5q{T\\S) and riT\s are respectively 
defined by Equations (2), (5), (7) and (8). 


Proof Let us define f = E(x,j,)„.,r l[(x, ?/)^supp(iS)] , then 
^P = E I[(x,2/)^supp( 5)] E E l[/i(x)7^?/]l[/i'(x)7^y] 

h~p h'~p 

1 ,^ l[^(x)7^y] l[^'(x)7^y] =ter\sip) 

(x,y)~7 \o h^p h'r^p 

= ti^Rp-ysiGp)—\d'j-\s{p)^ ^ i sup i? 7 -\ 5 (/i) = Pr\s ■ 

Then, with fiq= l3q{T\\S) andpsuchthat 
er(p) = E E E l[h{yL)fy]l[h'{-x)fy] 

(^,y)~Th~ph'~p 

’For g > 0, we can show l3q{T\\S) — 2^~^^‘‘G''^''^\ where 
Dq{'T\\S) is the Renyi divergence between T and 5. 




r(x,y) 

S(x,y) 


■ E 

hr^P 


E l[h{-x)fy\l[h'{-}f)fy]+pp 

n' r^P 


(9) 


</3, 


E E E (l[hi^)^y]l[h'{^)^y]y 

h~ph'~p(x,y)^S 


1 

P 


+Vp- 


Last line is due to Holder inequality. Finally, we remove 
the exponent from expression (l[/i(x) f j/]l[/i'(x) f j/])^ 
without affecting its value, which is either 1 or 0, and the 
final result follows from Equation (6). □ 


Note that the bound of Theorem 3 is reached whenever the 
domains are equal (S = T). Thus, when adaptation is not 
necessary, our analysis is still sound and non-degenerated: 

Rs(Gp) = Rr(Gp) < 7 dr(p) + 1 X [e5(p)]^-I-0 

= |d5(p)-Fe5(p) = R5(Gp). 

Meaningful quantities. Similarly to the previous re¬ 
sults recalled in Section 3, our domain adaptation theo¬ 
rem bounds the target risk by a sum of three terms. How¬ 
ever, our approach breaks the problem into atypical quanti¬ 
ties; (i) The expected disagreement d 7 -(p) captures second 
degree information about the target domain, (ii) The do¬ 
mains’ divergence (3q{T\\S) weights the influence of the 
expected joint error (p) of the source domain; the pa¬ 
rameter q allows us to consider different relationships be¬ 
tween f3q{T\\S) and es{p). (iii) The term rip\s quanti¬ 
fies the worst feasible target error on the regions where the 
source domain is uninformative for the target one. In the 
current work, we assume that this area is small. 

5. Comparison With Related Works 

In this section, we discuss how our domain adaptation 
bound can be related to some previous works. 

5.1. On the previous PAC-Bayesian bound 

It is instructive to compare the new bound of Theorem 3 
with the previous PAC-Bayesian domain adaptation bound 
of Theorem 2. In Theorem 3, the non-estimable terms are 
the domain divergence /3q{T\\S) and the term pp\s- Con¬ 
trary to the non-controllable term A(p) of Theorem 2, these 
terms do not depend on the learned posterior distribution p; 
For every p on H, f]q{T\\S) and ri-r\s are constant val¬ 
ues measuring the relation between the domains. More¬ 
over, the fact that the domain divergence f3q{T\\S) is not 
an additive term but a multiplicative one (as opposed to 
disp(5x,7x)-l-A(p) in Theorem 2) is a contribution of our 
new analysis. Consequently, j3q{T\\S) can be viewed as a 
hyperparameter allowing us to tune the trade-off between 
the target voters’ disagreement and the source joint error. 
Experiments of Section 8 confirm that this hyperparameter 
can be successfully selected. 
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5.2. On some domain adaptation assumptions 


6. PAC-Bayesian Generalization Guarantees 


In order to characterize which domain adaptation task may 
be learnable, Ben-David et al. (2012) presented three as¬ 
sumptions that can help domain adaptation. Our Theo¬ 
rem 3 does not rely on these assumptions, but they can be 
interpreted in our framework as discussed below. 

On the covariate shift. A domain adaptation task fulfills 
the covariate shift assumption (Shimodaira, 2000) if the 
source and target domains only differ in their marginals ac¬ 
cording to the input space, i.e., 7y|x(2/) = Syi^iy)- In this 
scenario, one may estimate /?g(7x||‘5x). and even r]'j-\s, 
by using unsupervised density estimation methods. Inter¬ 
estingly, by also assuming that the domains share the same 
support, we have = 0. Then from Line (9) we obtain 


Rr(Gp) = §dr(p)+ E ™ E E l[h{^)fy]l[h' 

x~ox 'h'^ph'^p 

which suggests a way to correct the shift between the do¬ 
mains by reweighting the labeled source distribution, while 
considering the information from the target disagreement. 
On the weight ratio. The weight ratio (Ben-David et al., 
2012) of source and target domains, with respect to a 
collection of input space subsets B C 2^, is given by 

Sx{b) 


Cb{S,T) = 


inf 


Tx{b) ' 

When Cb{S,T) is bounded away from 0, adaptation 
should be achievable under covariate shift. In this context, 
and when supp(5) = supp(T), the limit case of /3oo(T||5) 
is equal to the inverse of the point-wise weight ratio 
obtained by letting B = {{x} : x G X} in CjsiS^T). 
Indeed, both /3q and Cg compare the densities of source 
and target domains, but provide distinct strategies to relax 
the point-wise weight ratio; the former by lowering the 
value of q and the latter by considering larger subspaces B. 
On the cluster assumption. A target domain fulfills 
the cluster assumption when examples of the same label 
belong to a common “area” of the input space, and the dif¬ 
ferently labeled “areas” are well separated by low-density 
regions (formalized by the probabilistic Lipschitzness of 
Urner et al., 2011). Once specialized to linear classifiers, 
d 7 -(p) behaves nicely in this context (see Section 7). 


5.3. On representation learning 

The main assumption underlying our domain adaptation al¬ 
gorithm exhibited in Section 7 is that the support of the tar¬ 
get domain is mostly included in the support of the source 
domain, i.e., the value of the term rjp^s is small. When 
T\S is sufficiently large to prevent proper adaptation, one 
could try to reduce its volume while taking care to preserve 
a good compromise between dp{p) and eg (p), using a rep¬ 
resentation learning approach, i.e., by projecting source 
and target examples into a new common space, as done for 
example by Chen et al. (2012); Ganin et al. (2016). 


To compute our domain adaptation bound, one needs to 
know the distributions S and 7x. which is never the case 
in real life tasks. The PAC-Bayesian theory provides tools 
to convert the bound of Theorem 3 into a generalization 
bound on the target risk computable from a pair of source- 
target samples T)^(iS)"‘“x(7x)™*- To achieve this, 
we first provide generalization guarantees for d 7 -(p) and 
eg(p). These results are presented as corollaries of Theo¬ 
rem 4 below, that generalizes a PAC-Bayesian theorem of 
Catoni (2007) to arbitrary loss functions.^ Indeed, Theo¬ 
rem 4, with i{h, X, y) =l\h{x)f^y\ and Equation (1), gives 
the usual bound on the Gibbs risk. 

Theorem 4. For any domain V over^xY, any set of vot¬ 
ers FL, any prior tt over FL, any loss i : "HxXxy— >[0,1], 
any real number c>0, with a probability at least 1—(5 over 
the choice of{{xi, j/i)}™ we have for all pon FL: 


E E i(h,x,y) 

(x,!/)~-D h^p 

1 

< 


1—e” 


V E£{h,x„yi) + 

• * hr^ n 


m hr^p 


KL(p||7r) -fln^ 
m X c 


Note that, similarly to McAllester & Keshet (2011), we 
could choose to restrict c G (0, 2) to obtain a slightly looser 
but simpler bound. Using e“'^ < 1 — c — an upper 
bound on the right hand side of above equation is given by 


i-i 


:kT,Zi'^h~p£{h,x„yi) + 


KL(p||7r)-|-ln i 


We now exploit Theorem 4 to obtain generalization guar¬ 
antees on the expected disagreement and the expected joint 
error. PAC-Bayesian bounds on these quantities appeared 
in Germain et al. (2015), but under different forms. In 
Corollary 5 below, we are especially interested in the pos¬ 
sibility of controlling the trade-off—^between the empirical 
estimate computed on the samples and the complexity term 
KL(p||7r)—with the help of parameters b and c. 


Corollary 5. For any domains S and T over XxU, any 
set of voters FL, any prior tt over FL, any 5g(0, 1], any real 
numbers b > 0 and c > 0, we have: 

— with a probability at least 1—<5 over T ~ (7x)™*. 


Vp on FL, dr{p) < 


l—e~ 


3 

rut X c 


— with a probability at least 1—6 over S ~ {Sy 


on H, esip) < 


1—e 


-b 


es(p) + 


2KL(p|K)+lni 


rUs X b 


where dylp) andes{p) are the empirical estimations of the 
target voters’ disagreement and the source joint error. 

*To do so, we exploit a result of Maurer (2004) that allows to 
generalize PAC-Bayes theorems to arbitrary hounded loss func¬ 
tion (see the proof of Theorem 4 in supplemental). 
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Proof. Given tt and p over PL, we consider a new 
prior TT^ and a new posterior p^, both over TH?, such 
that: hij = {hi,hj) G Tr‘^{hij) = n{hi)Tr{hj), and 

P^{hrj) = p{hi)p{hj). Thus, KL(p 2 || 7 r^) = 2KL{p^\\tt'^) 
(see Germain et ah, 2015). Let us define two new loss func¬ 
tions for a “paired voter” hij G Ti.^: 

^d{hij,x,y) = l[/i*(x) 7 ^ fij(x)] , 
and 4(/iij,x,y) = l[4(x) ^ y]xl[hj{x) y] . 

Then, the bound on d^ip) is obtained from Theorem 4 with 
i := id, and Equation (2). The bound on es{p) is similarly 
obtained with i := if, and using Equation (5). □ 

Eor algorithmic simplicity, we deal with Theorem 3 when 
g—>- 00 . Thanks to Corollary 5, we obtain the following gen¬ 
eralization bound defined with respect to the empirical esti¬ 
mates of the target disagreement and the source joint error. 
Theorem 6. For any domains S and T over XxE, any 
set of voters PL, any prior tt over PL, any (5€(0,1], any b>0 
and c>0, with a probability at least 1—<5 over the choices 
of ant/T~(7x)'"S we have 

Vp on PL, Rr(Gp) < c' 5 drip) + b'esip) + r]r\S 

+ (i + liL^)(2KL(p|l7r) + ln|), 

where d^jp) andes{p) are the empirical estimations of the 
target voters’ disagreement and the source joint error, and 
b' = Y^PUnS), andc' = 

Proof We bound separately d 7 -(p) and 65 (p) using Corol¬ 
lary 5 (with probability 1—| each), and then combine the 
two upper bounds according to Theorem 3. □ 

Erom an optimization perspective, the problem suggested 
by the bound of Theorem 6 is much more convenient to 
minimize than the PAC-Bayesian bound derived from The¬ 
orem 2 in Germain et al. (2013). The former is smoother 
than the latter: the absolute value related to the domain 
disagreement diSp(iSx,7x) of Equation (4) disappears in 
benefit of the domain divergence l3oo{T\\S), which is 
constant and can be considered as an hyperparameter of 
the algorithm. Additionally, Theorem 2 requires equal 
source and target sample sizes while Theorem 6 allows 
rrisf^rnt. Moreover, recall that in Germain et al. (2013) 
the p-dependent non-constant term A(p) is ignored. In our 
new analysis, such compromise is not mandatory in order 
to apply the theoretical result to real problems, since the 
non-estimable term rix\s is constant and does not depend 
on the learned p. Hence, we can neglect ri'r\s without any 
impact on the optimization problem described in the next 
section. Beside, it is realistic to consider P 7-\5 as a small 
quantity in situations where the source and target supports 
are similar. 


7. Specialization to Linear Classifiers 

In order to derive an algorithm, we now specialize the 
bounds of Theorems 3 and 6 to the risk of a linear clas¬ 
sifier /iw, defined by a weight vector w G : 

Vx G X, /iw(x) = sign (w • x). 

The taken approach is the one privileged in numerous PAC- 
Bayesian works (e.g., Langford & Shawe-Taylor, 2002; 
Ambroladze et al., 2006; McAllester & Keshet, 2011; 
Parrado-Hernandez et al., 2012; Germain et al., 2009; 
2013), as it makes the risk of the linear classifier /iw and 
the risk of a (properly parametrized) majority vote coin¬ 
cide, while in the same time promoting large margin clas¬ 
sifiers. To this end, let PL be the set of all linear classifiers 
over the input space, PL = {Lw' | w' G , and let pw 
over "H be a posterior distribution, resp. a prior distribu¬ 
tion TTo, that is constrained to be a spherical Gaussian with 
identity covariance matrix centered on vector w, resp. 0, 


Vfiw' G H, 

II 


and 

II 

> 

0 



The KL-divergence between pw and ttq simply is 

KL(pw|l7ro) = lllwll^ . (10) 


Thanks to this parameterization, the majority vote classi¬ 
fier Bp^ corresponds to the one of the linear classifier 
(see above cited PAC-Bayesian works). That is. 


'PxGX,wGPL, /iw(x) = sign 


E /ly 


= -Sp„(x). 


Then, Rx)(/iw) = R'd(-Bpw) for data distribution V. 
Moreover, Langford & Shawe-Taylor (2002) showed that 
the closely related Gibbs risk (Equation 1) is related to the 
linear classifier margin y as follows: 


where , and Erf(x) = dt 

is the Gauss error function. Here, <I>(x) can be seen as a 
smooth surrogate—sometimes called the probit loss (e.g., 
McAllester & Keshet, 201 1) —of the zero-one loss function 
l[a: < 0 ] relying on y Note that ||w|| plays an impor¬ 
tant role on the value of RpjGp^), but not on Rx>{hw)- 
Indeed, R-D(Gp„) tends to R-d(/iw) as ||w|| grows, which 
can provide very tight bounds (see the empirical analyses 
of Ambroladze et al., 2006; Germain et al., 2009). In the 
PAC-Bayesian context, ||w|| turns out to be a measure of 
complexity of the learned classifier, as Equation (10) shows. 
We now seek to express the expected disagreement d-v (pw) 
and the expected joint error e-riiPw) of Equations (2) 
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Figure 1. Graphical representation of the 
loss functions given hy the specialization 
to linear classifiers. 



Figure 2. Decision boundaries of DALC on the intertwining moons toy problem, for fixed 
parameters B=C=1, and aRBF kernel fc(x, x') = exp(—||x — x'||^). The target points 
are black. The positive, resp. negative, source points are red, resp. green. The blue 
dashed line shows the decision boundaries of algorithm PBDA (Germain et ah, 2013). 


and (5) related to the parameterized distribution p^. As 
shown in Germain et al. (2013) the former is given by 

dx>(Pw) = E $dis ( ] , 

x-X>x V l|x|| J 

where <i)dis(a^) = 2x$(a;)x<i)(— at). Following a similar 
approach, we obtain, for all w G K, 


Corollary 8. For any domains S and T over Xxy, any 
i5g(0, 1], any a>0 and 6>0, with a probability at least 1—5 
over the choices of and T^(7x)™*> we have 

VwgK: Rr(^w) < c'dT(pw) + 2fi'es(pw) + 2pr\s 


ei5(pw) = E E E l[/i(x)7^y]l[/i'(x)7^y] 

= E E I[h(x)^ 2 /] E l[/i'(x)^y] 

{x,y)~'D h~p^ h'r^p„ 

^ ^ f w ■ x\ 

~ ^ ^ -n ^ li iT ) ’ 

V I|X|| J 

with $err(;r) = As function d) in Equation (11), 

functions $err and <i)dis defined above can be interpreted as 
loss functions for linear classifiers (illustrated by Figure 1). 
Domain adaptation bound. Theorem 3 specialized to lin¬ 
ear classifiers gives the following corollary. Note that, as 
mentioned above, Rp{h-w) = Rr(^Pw) — 2Rr(G'p„). 
Corollary 7. Let S and T respectively be the source and 
the target domains on'K.xY. For all w G M, we have : 

Rr(^w) < d7-(pw) + 2/3oo(T||5) X e5(pw) + 2r77-\5 , 

Figure 1 leads to an insightful geometric interpretation of 
the domain adaptation trade-off promoted by Corollary 7. 
For fixed values of /3oo(T||5) and rjp\s, the target risk 
Rr (hw) is upper-bounded by a (/3oo-weighted) sum of two 
losses. The expected <i)err-loss {i.e., the joint error) is com¬ 
puted on the (labeled) source domain; it aims to label the 
source examples correctly, but is more permissive on the 
required margin than the $-loss (i.e., the Gibbs risk). The 
expected $dis-loss (i.e., the disagreement) is computed on 
the target (unlabeled) domain; it promotes large unsigned 
target margins. Thus, if a target domain fulfills the cluster 
assumption (described in Section 5.2), d 7 -(pw) will be low 
when the decision boundary crosses a low-density region 
between the homogeneous labeled clusters. Hence, Corol¬ 
lary 7 reflects that some source errors may be allowed if, 
doing so, the separation of the target domain is improved. 
Generalization bound and learning algorithm. Theo¬ 
rem 6 specialized to linear classifiers gives the following. 


For a source 5'={(xi, and a target T={(x')}™\ 

samples of potentially different size, and some hyperparam¬ 
eters C'>0, B>0, minimizing the next objective function 
w.r.t wgK is equivalent to minimize the above bound. 

C'dT(pw)-f 5es(pw) + ||w||^ (12) 

mt rris 

= $dis ( |Pc[lt) + ‘Kt) ■ 

2=1 2=1 


We call the optimization of Equation (12) by gradient de¬ 
scent the DALC algorithm, for Domain Adaptation of Lin¬ 
ear Classifiers. The kernel trick applies to DALC. That is, 
given a kernel fc:K‘^x]R'^—>'K, one can express a linear clas¬ 
sifier in a RKHS'^ by a dual weight vector a G ; 


/iw(-) = sign 


^ a,fc(x„ •) -f XI 


0^2 + 21 




.i=l i=l 

Even though the objective function is highly non-convex, 
we achieved good empirical results by minimizing the “ker- 
nelized” version of Equation (12) by gradient descent, with 
a uniform weight vector as a starting point. More details 
are given in the supplementary material. 


8. Experimental Results 

Firstly, Figure 2 illustrates the behavior of the decision 
boundary of our algorithm DALC on an intertwining moons 
toy problem'®, where each moon corresponds to a label. 

®It is non-trivial to show that the kernel trick holds when tto 
and pw are Gaussian over infinite-dimensional feature space. As 
mentioned by McAllester & Keshet (2011), it is, however, the 
case provided we consider Gaussian processes as measure of dis¬ 
tributions TTo and Pw over (infinite) H. 

'"We generate each pair of moons with the make_raoons func¬ 
tion provided in scikit-learn (Pedregosa et al., 2011). 
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Table 1. Error rates on Amazon dataset. Best risks appear in bold 
and seconds are in italic. _ 



SVM 

( CV ) 

DASVM 

(RCV) 

CODA 

(RCV) 

PBDA 

(RCV) 

DALC 

(RCV) 

books—>^DVDs 


0.179 

0.193 

0.181 

0.183 

0.178 ' 

books—^-electro 


0.290 

0.226 

0.232 

0.263 

0.212 

books—kitchen 


0.251 

0.179 

0.215 

0.229 

0.194 

DVDs—books 


0.203 

0.202 

0.217 

0.197 

0.186 

DVDs—>^electro 


0.269 

0.186 

0.214 

0.241 

0.245 

DVDs—kitchen 


0.232 

0.183 

0.181 

0.186 

0.175 

electro—^-books 


0.287 

0.305 

0.275 

0.232 

0.240 

electro—>^DVDs 


0.267 

0.214 

0.239 

0.221 

0.256 1 

electro—kitchen 


0.129 

0.149 

0.134 

0.141 

0.123 

kitchen—>■ hooks 


0.267 

0.259 

0.247 

0.247 

0.236 

kitchen—>^DVDs 


0.253 

0.198 

0.238 

0.233 

0.225 

kitchen—electro 


0.149 

0.157 

0.153 

0.129 

0.131 

Average 


0.231 

0.204 

0.210 

0.208 

0.200 


The target domain, for which we have no label, is a rota¬ 
tion of the source one. The figure shows clearly that DALC 
succeeds to adapt to the target domain, even for a rotation 
angle of 50°. We see that DALC does not rely on the re¬ 
strictive covariate shift assumption, as some source exam¬ 
ples are misclassified. This behavior illustrates the DALC 
trade-off in action, that concedes some errors on the source 
sample to lower the disagreement on the target sample. 

Secondly, we evaluate DALC on the classical Amazon.com 
Reviews benchmark (Blitzer et al., 2006) according to the 
setting used by Chen et al. (2011); Germain et al. (2013). 
This dataset contains reviews of four types of products 
(books, DVDs, electronics, and kitchen appliances) de¬ 
scribed with about 100, 000 attributes. Originally, the re¬ 
views were labeled with a rating from 1 to 5. Chen et al. 
( 2011 ) proposed a simplified binary setting by regrouping 
ratings into two classes (products rated lower than 3 and 
products rated higher than 4). Moreover, they reduced the 
dimensionality to about 40,000 by only keeping the fea¬ 
tures appearing at least ten times for a given domain adap¬ 
tation task. Finally, the data are pre-processed with a tf-idf 
re-weighting. A domain corresponds to a kind of product. 
Therefore, we perform twelve domain adaptation tasks. For 
instance, “books—^DVD’s” is the task for which the source 
domain is “books” and the target one is “DVDs”. We com¬ 
pare DALC with the classical non-adaptive algorithm SVM 
(trained only on the source sample), the adaptive algorithm 
DASVM (Bruzzone & Marconcini, 2010), the adaptive co¬ 
training CODA (Chen et al., 2011), and the PAC-Bayesian 
domain adaptation algorithm PBDA (Germain et al., 2013) 
based on Theorem 2. Note that, in Germain et al. (2013), 
DASVM has shown better accuracy than SVM, CODA and 
PBDA. Each parameter is selected with a grid search thanks 
to a usual cross-validation (CV) on the source sample for 
SVM, and thanks to a reverse validation procedure" (RCV) 

"For details on the reverse validation procedure, see Bruzzone 
& Marconcini (2010); Zhong et al. (2010). Other details on our 


for CODA, DASVM, PBDA, and DALC. The algorithms use 
a linear kernel and consider 2,000 labeled source examples 
and 2,000 unlabeled target examples. Table 1 reports the 
error rates of all the methods evaluated on the same sepa¬ 
rate target test sets proposed by Chen et al. (201 1). 

Above all, the adaptive approaches show the best result, im¬ 
plying that tackling this problem with a domain adaptation 
method is reasonable. Then, our new method DALC is the 
best algorithm overall on this task. Except for the two adap¬ 
tive tasks between “electronics” and “DVDs”, DALC is ei¬ 
ther the best one (six times), or the second one (four times). 
Moreover, according to a Wilcoxon signed rank test with 
a 5% significance level, we obtain a probability of 89.5% 
that DALC is better than PBDA. This test tends to confirm 
that our new bound improves the analysis done previously 
in Germain et al. (2013), in addition to being more inter¬ 
pretable. 

9. Conclusion 

We propose a new domain adaptation analysis for major¬ 
ity vote learning. It relies on an upper bound on the target 
risk, expressed as a trade-off between the voters’ disagree¬ 
ment on the target domain, the voters’ joint errors on the 
source one, and a term reflecting the worst case error in 
regions where the source domain is non-informative. To 
the best of our knowledge, a crucial novelty of our contri¬ 
bution is that the trade-off is controlled by the divergence 
Pq{T\\S) (Equation 7) between the domains; The diver¬ 
gence is not an additive term (as in many domain adapta¬ 
tion bounds) but is a factor weighting the importance of 
the source information. Our analysis, combined with a 
PAC-Bayesian generalization bound, leads to a new domain 
adaptation algorithm for linear classifiers. The empirical 
experiments show that our new algorithm outperforms the 
previous PAC-Bayesian approach (Germain et al., 2013). 

As future work, we first aim at investigating the case where 
the domains’ divergence (3q{T\\S) can be estimated, i.e., 
when the covariate shift assumption holds or when some 
target labels are available. In these scenarios, l3q{T\\S) 
might not be considered as a hyperparameter to tune. 

Last but not least, the term ^ 7-^5 of our bound—suggesting 
that the two domains should live in the same regions— 
can be dealt with a representation learning approach. As 
mentioned in Section 5.3, this could be an incentive to 
combine our learning algorithm with existing representa¬ 
tion learning techniques. In another vein, considering an 
active learning setup (as in Berlind & Urner, 2015), one 
could query the labels of target examples to estimate the 
value bounded by ?7r\S- We see this as a great source of 
inspiration for new algorithms for this learning paradigm. 

experimental protocol are given in supplementary material. 
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Supplementary Material 
A. Proof of Theorem 4 

Proof. We use the following shorthand notation: 

Cv{h)= E e{h,TC,y) 

(x,y)~-D 


By reorganizing the terms, we have, with probability 1—J 
over the choice of S' € I?"*, 


Vp on H : E Cvih) 

h'^p 


< 


l—e~ 


1 — exp —c E Cs{h) — 

hr^p 


KL(p||7r) + In 


1 \ n 
S 


and 

Cs{,h) = — V £{h,x,y). 
m 

{x.y)es 

Consider any convex function A : [0,1] x [0,1] —>■ M. Ap¬ 
plying consecutively Jensen’s Inequality and the change of 
measure inequality (see Seldin & Tishby (2010, Lemma 4) 
and McAllester (2013, Equation (20))), we obtain 


yponH : 


mxA E Cs{h), E Cv{h) 

\hr^p h~p ) 

< E TOX A (£5(/i), £x)(^)) 

hr^P 


< KL(p|j7r)+ln Xt^{S) 


with 

/i~7r 

Then, Markov’s Inequality gives 


Pr 



> l-(5, 


and 


E XJS') = E E 

= E E 

(13) 


The final result is obtained by using the inequality 
1—exp(— 0 )<z. □ 


B. Using DALC with a kernel function 

Let S = {(xj, T = {x'}™\ and m = + mf 

We will denote 



if # < TOs 
otherwise. 


(source examples) 
(target examples) 


The kernel trick allows us to work with dual weight vector 
a G that is a linear classifier in an augmented space. 
Given a kernel k : x —>■ R, we have 


h^(-) 


sign 


M 

y^Qf»fc(x,,-) 

.1=1 


Let us denote K the kernel matrix of size M x M such as 
Kij = k(xi,Xj). In that case, the objective function— 
Equation (13) of the main paper—can be rewritten in term 
of the vector 

a = (ai,a2, ■ ■ ■ cim) 


as 


C X ^ $ 






+5xE 


2=1 


^ Vi 




E E 

i=i i=i 


where the last inequality is due to Maurer (2004, Lemma 
3) (we have an equality when the output of £ is in {0,1}). 
As shown in Germain et al. (2009, Corollary 2.2), by fixing 

A(g,p) = -cxg - ln[l-p (l-e"'=)], 

Line 13 becomes equal to 1, and then E Xj^(S') < 1. 


Eor our experiments, we minimize this objective func¬ 
tion using a Broyden-Fletcher-Goldfarb-Shanno method 
(BFGS) implemented in the scipy python library (Jones 
et al., 2001-). 

We initialize the optimization procedure at ^ for all 
i G {1,...,m}. 

C. Experimental Protocol 


( Eor obtaining the results of Table 1, the reverse 

Vp on H . —c^E^£s()i) — ln[l—^E^£-p(^) (1—e )] validation procedure searches on a 20 x 20 parameter grid 

for a C between 0.01 and 10® and a parameter B between 
^ KL(p||7r) -f In j \ ^ ^ q and 10®, both on a logarithm scale. The results of the 
“ m j ~ other algorithms are reported from Germain et al. (2013). 

















